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Digital data explosion drives a demand for robust and reliable data storage medium. Development of better 
digital storage device to accumulate Zetta bytes (1 ZB = 10 21 bytes ) of data that will be generated in near 
future is a big challenge. Looking at limitations of present day digital storage devices, it will soon be a big 
challenge for data scientists to provide reliable, affordable and dense storage medium. As an alternative, 
researcher used natural medium of storage like DNA, bacteria and protein as information storage systems. 
This article discuss DNA based information storage system in detail along with an overview about bacterial 
and protein data storage systems. 
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1. INTRODUCTION 

With the extensive use of social networking and cloud computing, there is a paradigm 
shift in the volume of data produced. It is estimated that by 2020, 35 Zettabytes of 
digital information will be generated [Gantz and Reinsel 2010]. This highlights a big 
concern of storing and maintaining the rapid growth of data that enforces the data 


storage experts to design a new architecture to store the data [Leong et al. 20121 |Du 


2008|. Steming from the early days storage medium like rocks, stones, paper, punch 


cards, magne tic tapes, CD, DVD, fl oppy dis k, etc, to the modern days distribu ted cloud 
data storage [Dimakis et al. 20101 [Dimakis et al. 2011] [Bassoli et al. 2013]; there is 
a drastic advancement in the data storage devices (as depicted by Figure [I). But these 
magnetic and optical discs are big, need to be maintained regularly and are prone to 
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decay. Also they are not environment friendly as they consume vast amount of energy 
and release lots of heat. Scientist are trying to miniaturize the size of silicon chips up to 
many folds but this makes it more expensive. Alternatively, researchers instigate the 
use of living source from nature to preserve the data which give birth to biostorage. 
Biostorage is the field of storing and encrypting information in living cells or natural 
medium (see Fig[2]l [Baum 19951. Comparison of natural storage medium and digital 
storage devices is appraised in [Mansuripur 20021. Many natural storage medium like 
DNA, protein, bacteria have been explored. This review paper showcase the evolution 
of DNA based storage systems along with its information encoding methods in details. 
Though there are evidences of reliable and sca lable DNA based infor mation systems, 


recen t work s by Church et al [Church et al. 20121, Goldman et al [[Goldman et al. 
|2013| and [Grass et al. 2015] improved the efficiency of data encoding in DNA, which 
indicates the right time for the coding and information theorist communities to work 
on the challenges in natural data storage systems. Through this paper, one can witness 
the potential and challenges of DNA based information storage as well as other natural 
storage like bacterial and protein based storage systems. 

This review is structured as follows. Section 2 introduce the DNA based information 
storage system. Section 3 give brief description about bacterial data storage. Section 
4 describes the protein as hard drive. Section 5 highlights the experimental evidence 
and challenges. This paper concludes with general remark. 



Fig. 1. Advancement in the field of data storage devices is shown here. New paradigm to store data on 
DNA, protein, bacteria is indicated. 


2. DNA AS STORAGE DEVICE 

DNA is natural information storage molecule, which stores our genetic information, is 
the favoured solution to the ample amount of data. DNA stores the genetic information 
using four bases A (Adenine), C (Cytosine), G (Guanine), and T (Thymine) analogous to 
digital storage device like CD which stores the information using lands and pits repre¬ 
sented as 0’s and l’s on the spiral tracks. The potential of DNA as a hard drive is well 
described in BDOnofrio and An 2010| . DNA is natur e’s hardware that has been used for 
computation which gave rise to DNA computation [ Adleman 19941 [Deaton et al. 19981 
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Fig. 2. Potential natural data storage medium and its complexity is indicated. 


[Amo s 19991 [Ezziane 20061 1 | |Xu and Tan 20071 l lWatada and binti Abu Bakar 20081 . 
DNA has a wonderful property of stability, long term storage, requires no electricity 
and needs no management beyond keeping it in a cold and dark place. DNA has self 


[ Battail 2006] 

[Gupta 2006 

! 

Milenkovic and Kashyap 2006 

11 Battail 2007] [Faria et al. 

20121 [Faria et al. 20141 w 

lich points to the capacity of DNA for the error correction. 


cleotide base pairs. Initially DNA was used only to store text but with advancement in 
the field depicted in Figure [6] now DNA can be used to store any kind of data with 100 
percentage of data accuracySchematic diagram of how one can store data in DNA can 
be viewed in Fig [3] 

2.1. Encoding data in DNA 

Data encoded in DNA can be used for encryption [Bancroft and Clelland 20061 or long 


term storage. Based on the purpose, DNA can be embedded in non-coding DNA (nc- 
DNA) or protein coding DNA (pc-DNA) or synthetic DNA. One can represent each 
base pair by using 2 bits, which gives 4 different possibilities that can be mapped to 16 
combinations of DNA base pairs ( for instance 00 —» AT, 01 —> GC, 10 —> TA and 11 —> 
CG). A single byte (or 8 bits) can represent 4 DNA base pairs. The entire diploid human 
genome can be represented in terms of bytes, as described: 6 * 10 9 base pairs/diploid 
genome x 1 byte/4 base pairs = 1.5 * 10 9 bytes or 1.5 Gigabytes [Grigoryev 20121. If 
we want to calculate data that can be stored in human body with consideration of hu¬ 
man body consisting of 100 trillion cells, we will have 150 Zettabytes approximately 
(150 * 10 12 * 10 9 bytes) data stored in the DNA of any human. In the 20^' century, 


many researcher have translated English text, mathematical equations [Yachie et al. 

J . 0 7 ~L M 

2007J, latin text [Portney et al. 2008J and simple musical notations [Ailenberg and 


and O hashi 20041 [Sk inner et al. 20071 1 [Yamam oto et a l. 2008 1 | |HeIder~a nd Barn ekow 


Rotstein 20091 to DNA using different DN A coding principle s [Wong et al. 20031 [Arita 


2008J. Following are the main encoding approaches proposed for DNA based informa¬ 
tion storage systems as shown in Figure [4] 

2.1.1. Microvenus and Genesis project. Microvenus project was initiated by Joe Davis to 
convert image in DNA that allude the idea of storing a-biotic data in DNA. Microvenus 
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How to store data on DNA 



Types of Errors 
ATGTCGATCGACT 

4 - 

ATGCTCGATCGACT 

ATCATCGATCGTCT 



Fig. 3. Schematic representation of how to store data on DNA. Types of errors that may occur during DNA 
based data storage is depicted in the figure. 


[Davis 19961, a small organism comprises of short piece of synthetic DNA used to en¬ 
code visual icon in bacteria E.coli. Data encoding is done according to the molecular 
size of the bases. C is the smallest base assigned with 1, T— >2, A—>3 and G—>4. In¬ 
stead of numeric values, each nucleotide was assigned with phase structure like C-aX, 
T—5>XX, A->XXX, G- /XXXX. The encoding was done by placing the nucleotide at each 
repeated position of bits Os and Is. Nucleotides were placed according to number of re¬ 
peated bits of Os and Is. For instance, 1001011 = CTCCT, 10101= CCCCC. Mirovenus 
created was inserted into bacterial host cell by using plasmids. Encoding scheme used 
was not accurate, efficient and DNA developed to store data is not uniquely decod- 
able. In subsequent year, other form of DNA based data encoding named Genesis [Kac 
19991, an artwork of Eduardo Kac was introduced. He created artificial art gene that 
comprises of digital DNA by converting the lines from bible into Morse code. Morse 
code denoted by dot (.) and dash (-) was converted to nucleotides with the principle 
rule of converting dash (-) and dot (.) to T and C and replacing word space and letter 
space by A and G. This gene was then inserted into florescent E. Coli bacteria. Both 
these laid foundation for storing data in DNA. But it lacks efficiency, accuracy and 
better encoding and decoding methods. 


2.1.2. PCR based encoding models. In Clelland encoding models [Clelland et al. 19991, 
microdots was used to cipher the data in human genomic DNA. Secret message was 
inserted between the PCR (Polymerase Chain reaction) forward and reverse primer 
sequences called template regions. The idea was to encode the characters by trivial 
assignment of DNA codon as encryption key to each characters and insert it in human 
genomic DNA. For total possible 64 codons, each English characters will be assigned 
codons and rest can be used to encode some of the symbols like dots and commas. This 
was carried out in vitro by combining the message DNA with the genomic DNA in 
the solution over a 16-point microdot printed on filter paper. Decoding was based on 
template regions using PCR amplification. Recipient must be aware of encryption key 
and primer sequences. Data stored was secured but major limitation was scalability 
of data encoded in the limited size of microdots, only 136 bits data was encoded by 
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Fig. 4. PCR based, alignment based and primer library based encoding model is shown here. Fig 1(a) PCR 
based encoding model is template based method that consist of forward and reverse primer between which 
the information is inserted. Decoding is done by using this primers as template by PCR amplification . Fig 
1(b) is the type of error that is base flip or break point indicates by B, deletion denoted by D and insertion of 
base denoted by i. Figure 1 (c) demonstrates the alignment based encoding model. Here IDNA sequence data 
that is converted to four different DNA chunks Cl, C2, C3 and C4 using encoding principle. DNA chunks are 
inserted at different loci in genome. For decoding (see fig. 1(d)), genome is sequenced and multiple alignment 
of different loci of genome is done for duplicated information inserted in multiple copies of genomic DNA. 
Figure 1(e) depicts encoding data in different plasmids in various location of bacterial genome. Data can be 
retrieved back by sequencing the index primers followed by sequencing of plasmid library. 


using this approach. To make it more accurate, Bancroft [Bancroft et al. 20011 pro¬ 
posed the concept of information DNA (iDNA) that comprises information and single 
poly primer key (PPK) along with forward and reverse primer and common 5-6 bases 
spacer to indicate the stored information. This concept resembles to the retrieval of 
information from an addressable storage device such as the random access memory 
in a computer where, PPK acts as data location identifiers. Data encoding was done 
by mapping ternary codes to only three bases A, C, and T but sequencing primers 
were designed with all four bases with the requirement that each fourth position be 
a G to prevent mispriming. Decoding can be done by sequencing PPK first to decode 
the forward and reverse primers and then based on specific sequencing primer one 
can retrieve the information. Total data encoded by using this approach was 561 bits. 
Drawbacks for PCR based methods are requirement of PCR, knowledge of primers 
and extensive experimental hurdles and practical issues. Moreover main drawback for 
PCR based methods are insertion of errors in template regions make the retrieval of 
encoded data impossible. 


2.1. 3. Alignment based enc oding models. For the first time Yachie et al., [Yachie et al. 
2007| | Yachie et al. 2008| introduced PCR independent alignment based data en- 


cryption using four bits per two bases encoding scheme. In this multiple sequence 
alignment based approach was used to encode the information into genomic DNA of 
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B.subtilis. Series of conversion of text to keyboard scan codes followed by conversion to 
hexadecimal code was followed to convert binary code to nucleotides with a designed 
nucleotide mapping four bits per two base. Multiple DNA oligonucleotides sequences 
carrying duplicated information was termed as cassettes and each cassettes were in¬ 
serted redundantly into multiple loci of the Bacillus subtilis genome. Data can be re¬ 
trieved by multiple sequence alignment of the bit data sequences followed by genome 
sequencing without the need of template DNA or parity checks. Main drawback for the 
alignment based encoding was the size limit of the cassette oligonucleotides used to en¬ 
code message. If it exceeds certain length, it will occur by chance in host genome. Also 
complete genome sequencing was needed to retrieve the data. All the models described, 
encoded text data in DNA, but in 2009 Ailenberg et al., (Ailenberg and Rotstein 2009) 
proposed an improved Huffman coding with unique primer design, for the first time, 
they encoded text, images and musical characters in DNA. It employed modified base 3 
Huffman code by dividing the keyboard characters into three groups of DNA codon and 
assigning DNA bases according the frequency of their occurrence. DNA sequences was 
synthesised and inserted in plasmid and decoding was done by sequencing of plasmid. 
Initially index plasmid which consist of information like title, authors, plasmid num¬ 
ber and primers assignments is constructed. They used specific primers with unique 
prefix code for different types of file, for instance, text data was initiated by ”tx” and 
music data was initiated by ”mu”. Data was inserted in plasmid with unique sequenc¬ 
ing primer for information retrieval. They used nucleotides efficiently by encoding 4.9 
bases per characters and encoded 1688 bits data. 


2.1.4. Church and Goldman encoding model. Although aforementioned work were corner¬ 
stone for storing data to DNA they were successful on a small scale as it encoded 
small bits of data. The most rewarding work was done in recent times by Church, 
et al. 2012, at Harvard University . Using next generation synthesis and sequencing 
technology, Church came up with efficient one bit per base algorithm of encoding infor¬ 
mation bits into fix length of DNA chunks (99 bases). Flanking primers at the begin¬ 
ning and end of information data was inserted to identify the specific DNA segment in 
which the particular data was encoded | Church et al. 2012] . They encoded entire book 
(Regenesis: How Synthetic Biology Will Reinvent Nature and Ourselves ISBN-13:978- 
0465021758), including 53,426 words, 11 JPG images and one JavaScript program into 
54,898 oligos each 159 nucleotide (nt) in length and consisting of a 96-bit data block 
(96 nt), a 19-bit address (19 nt) specifying the data block location and flanking 22 nt 
common sequences to facilitate amplification and sequencing. Initially the book to be 
encoded was converted into HTML format including all the images in it. The indi¬ 
vidual bits was converted to DNA sequence with conversion principle 1 bit per base 
encoding, A or C for 0 and T or G for 1. The bases were selected randomly avoiding 
homo polymer greater than 3 and constant GC content. The bits were indexed by 19 
bits long bar-code sequence of consecutive number starting from 0000000000000000001 
which determines the location of encoded bits within the book. Each DNA segment was 
of length 12 without bar-code and the total number of oligonucleotides generated was 
5.27 MB. Specific primer sequence of 22 nucleotide for the sequencing was designed and 
amplified using PCR. The sequence was read using an Illumina HiSeq next generation 
sequencer. In writing and reading DNA, 10 bits error occurred from 5.27 MB. It has 
only one drawback of lacking error correction scheme that was taken care by Goldman 
with 100 percent of data retrieval. 

In 2013 Goldman used one bit per base system introduced by Church and modify¬ 
ing it by employing the improved base 3 Huffman coding (trits 0, 1 and 2). In this 
original file in binary code (0, 1) is converted to a ternary code (0, 1, 2), which is in 
turn converted to the triplet DNA code. It involved four steps shown in Fig [5] Binary 
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Binary file 


Base 3 encoding 


DNA-encoded 



DNA fragments 



DNA indexing 
information to locate 
DNA fragment 


Fourfold redundancy added 
by repeated reverse 
complement of DNA in each 
alternate fragment 


Fig. 5. Stepwise encoding of data into DNA using Goldman’s approach is elucidated in detail. Binary data 
converted to base 3 Huffman code which then converted to DNA sequences. Each DNA sequences converted 
to fragments with each 75 base pairs overlapped in alternate fragment with reverse complement. 


digits holding the ASCII codes was converted to base-3 Huffman code that replaces 
each byte with five or six base-3 digits (trits). Each of trit was encoded with one of the 
three nucleotides different from the previous one used to avoid homo polymers that 
cause error in synthesis of DNA. DNA strand was divided into chunks each of length 
117 base pair (bp). 75 bases for each DNA information chunks were overlapped with 
four fold redundancy to recover the data loss that occurred during synthesis and se¬ 
quencing DNA. For the data security each redundant chunk was converted to reverse 
complement of the strand in every alternate chunks. Each DNA chunk was appended 
with data address blocks of 117 bases to determine the location of segment in overall 
data. One parity check bit was added for the intra file location and error detection. 
Total 153,335 DNA strings were generated. 33 nucleotide base primer was added to fa¬ 


cilitate synthesis process and amplification. For details reader is referred to [Goldman 


et al. 20131. As proof of concept, they used four different file types (739 kilobytes file 
size) and achieved 2.2 PB/g DNA storage density. 


2.2. Error correction in DNA based information systems 

There are different types of errors associated with DNA data storage systems which 
are physical errors and genetic errors. Physical errors occur during synthesis and se¬ 
quencing of DNA and genetic errors are caused by mutations which occurs naturally 
during evolution and prolongation. Error can be insertion, deletion or substitution of 
single base in DNA sequences. Substitution of single base can be considered as bit flip 
errors. Other type of error can be deletion of bunch of DNA nucleotides categorized as 


ACM Journal on Emerging Technologies in Computing Systems, Vol. V, No. N, Article A, Pub. date: January YYYY. 












A:8 


D Limbachiya et al. 


burst error. Reading error rates ranges from 1-3 % while writing error has error rates 
upto 15 %. Error models propose d so far have focus on physical errors like sub stitution 
and deletion of oligonucleotides [Haughton and Balado 2011J [Kiah et al. 20141 but no 

work has been done on insertion error model. _ 

There are three basic codes for storing data in DNA [Arita 20041 which are Huffman 
code, comma code and the alternating code. Although comma-free code and alternate 
codes are robust and has ability to correct against small-scale damage such as DNA 
point mutations, this cannot recover broken data block from the data-encoded DNA 
region. This breakpoints can be corrected by Huffman coding [Yachie et al. 2008]. The 


DNA encoding by Huffman codes developed the Huffman coding method [Huffman 
19521 is uniquely decodable. In this method, the probabilities of the symbol is con¬ 


sidered (here the symbols are the English Alphabet). The least probabilities symbols 
are added to generate the next symbol and the process is repeated until we get the 
unique codes for all the symbols. For base 2 Huffman code, the least two probabilities 
are added and reduced two generate compact code until all the symbols are coded with 
code 0 and 1. Likewise for base 3 and 4 base Huffman, least three and least four proba¬ 
bilities symbols respectively are reduced to compact code. The Huffman code described 
in [Smith et al. 20031 , used the probabilities of the English alphabet. According to this 
the highest occurred letter e has single base code and the least occurred z has code 
length of 5 bases. For this base 2 Huffman code, the two least frequencies are summed 
up to give compact code. The codes are assigned by varying the wobble position (third 
position of the codon) for the alphabet with similar probabilities. Average code length 
of the code is 2.2 which is shorter compare to other codes. The drawback of the method 
is that it do not include symbols and other characters. The improved Huffman code us¬ 
ing all the English alphabets and special character wa s described using specific base 
assignment with uniquely designed primer sequences [Ailenberg and Rotstein 20091. 
The remar kable Huffman method was used by Goldman and his colleagues [Goldman 
et al. 20131 with error correction techniques. 

In recent year, DNA storage channel was described by Han Mao Kiah et al; in which 
they represented reading and writing errors during DNA data storage as profile vec¬ 
tors and designed a family of error correcting codes for synthesis and sequencing errors 
[ |Kiah et al. 2014) . They developed a codeword designed technique that resulted in the 
codes at sufficiently large distance that makes it best possible for error correction. 
DNA has essential property for long term archival of data compare to digital stor¬ 
age devices. To witness long term storage of DNA and improve the DNA stability, re¬ 
searchers have develop chemical based [Grass et al. 20151 method to encapsulate DNA 
into glass sphere and preserve it from environmental damage for long term archival. 
They used very prominent error correcting code Reed Solomon codes, which are used 
in digital storage devices like CD, DVD, to implement two layer of encoding one at 
DNA chunk level and other at synthesis and sequencing of DNA. This method can cor¬ 
rect burst error. Recently re-writable DNA based data storage systems is proposed in 
lYazdi et al. 20151, in which they used unique addressing scheme by which data can 
be randomly accessed unlike previous techniques in which random access of data was 
not possible. 

3. SHANNON INFORMATION OF DNA 


Shannon information [Shannon 20011 for DNA is number of bits per DNA base. Theo¬ 
retical limit of Shannon capacity of DNA is 455 Exabytes per one gram of DNA. It can 
be derived by considering 2 bits per nucleotide of single stranded DNA and average 
molecular weight of DNA 330.05 g/mol/nucleotide. Shannon capacity of DNA can be ob¬ 
tained by calculating weight density per bit (2.74 x 10 20 gram per bit), then calculating 
number of bytes in one gram of DNA (1 2.19 x 10 21 = 4.55 x 10 20 bytes per gram 
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Goldman et al, 2013 


Church et al, 2012 


Ailenberg et al, 2009 


Yachie et al, 2007 Gibson et al., 2010 



2005 2006 2007 2006 2009 2010 2011 2012 2013 


Time 


Fig. 6. Time-line for DNA based data storage systems. 


Table I. Comparative study of the encoding schemes used to encode data in DNA with their limitations and constraints 


Source of storage | Encoding Scheme | Purpose | Limitation | Constraints^ 


Microdots 

4 base encoding 
using PCR primers 
and encryption key 

Data security 
and privacy 

No error correction used 

Knowledge of PCR primer sequences 
and encryption is mandatory 

iDNA 

Poly primer sequence with 
designed encryption scheme 

First implementation for actual 

data stored in DNA using Microdot experiment 

Security was not considered 
and not reliable 

under adverse environmental condition 

Information can be lost 

and designing of encryption key 

Data encoded in 
recombinant DNA 
inserted in vector 

64 codon sequence mapped 
to ASCII characters uniquely 

Safe guard the data encoded in DNA inserted in vectors 
which can resist to adverse conditions 
like radiations, extreme temperature 

Very few base pair information can be encoded 
due to size limitation 
of genome size of 

the vector (E.coli and D .radiodurans) used 

Preparation of Recombinant DNA and 
designing vectors with proper sentinels 

Plasmid Library 

Huffman coding to 

English alphabets 

Storage of different file formats in DNA 

Use 7 base encoding scheme 

which may be ambiguous in decoding the data 

Designing of specific primer 
sequence 

Synthetic DNA 

One bit per base- 
0-A/C and 1- G/T 

Successfully stored and retrieved data from 

DNA 

Lack in effective error correction 

Ambiguity in the sequences 

Synthetic DNA 

Huffman code of base 3 
to ASCII characters 

Scaling the amount of data stored in DNA 
with effective error correction 

Time consuming for larger data 

Knowledge of Synthesis 
and Sequencing of DNA 


of DNA). Different models are suggested based on errors for the Shannon capacity of 
DNA. One of the prominent idea about capacity of DNA was described by Vinhthuy 
Phan et al; [Phan and Garzon 20051 in terms of hybridization model. He stated that 
it is very difficult to estimate the Shannon capacity for DNA of given length to store 
data. He considered hybridization can occur only if set of DNA are at some distance 
with parameter r for the reaction stringency and gave the lower bound on DNA ca¬ 
pacity to store the a-biotic data. Other models considered mutation in DNA sequences 
to estimate the Shannon capacity. Capacity of DNA under the substitution [Balado 


2010}, insertion and deletion error [Balado 20131 was proposed by F. Balado. Consider¬ 


ing encoding of data in non coding or coding region of genome, upper limit of Shannon 
capacity for amount of data stored in DNA under the error rate was specified. Upper 
bound for the DNA storage capacity without error is 2 bits per base. All the encoding 
methods used for embedding data in DNA have achieve Shannon information den- 
sity ranging from 2 bits per base | Wong et al. 2003) to 0.213 bits per base [Heider 
and Barnekow 20081 [ Ailenberg and Rotstein 2009) and 0.096 bits per base [Arita and 


Ohashi 20041. Goldman achieve 1.58 bits per base Shannon information for each DNA 


string. There are many low hanging fruits for designing the optimal capacity achieving 
codes for DNA storage with better code rate and length of DNA chunk. 


4. BACTERIAL HARD DRIVE 

J.Cox [Cox 20011 suggested that suitable host to store data in DNA are Bacillus sub- 
tilis and Saccharomyces cerevisiae (bakers yeast). Yeast has higher density than bac- 
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teria but it is practically challenging. Bacteria has ability to survive in any condition 
like nuclear radiation, high temperature, deep under soil and water and in any haz¬ 
ardous condition. Potential bacterium that can be used for data storage are B. subtilis, 
M.magneticum, D. radiodurans and Mycoplasma genitalium. The idea of using bac¬ 
teria as storage was pioneered by Masaru Tomita and Yachie [Yachie et al. 20071 in 
which they stored famous Einstein relativity equation E MC Square in soil bacteria 
B.subtillius to repeatedly store message DNA in multiple loci of genomic DNA of bac¬ 
teria. They could encode 120 bits in 4.2 Mb genome of bacteria and decoded it back by 
multiple sequence alignment. 

Remarkable work was done by team of student at Chinese Hong Kong Univer¬ 
sity using E .coli as medium to store the data. It has storage capacity of 4502,000- 
gigabyte hard disks per gram of bacteria. This technology has many advantages over 
the magnetic data storage medium mainly information cannot be hacked and can de¬ 
fend against cyber attacks which points to higher data security than computer storage. 
An encoding system takes the original data, turns it into a quaternary number, and 
then encodes it as a DNA sequence by mapping 0,1,2,3 to A,T,C,G respectively with 
storage capacity of 1 Kb per cell. DNA sequences were compressed using deflate al¬ 
gorithm. This loss less data compression technique is important for two aspects, one 
is to increase the information storage capacity and other to avoid homo polymer ad 
repetitive regions in DNA. Information was broken down into fragments which consist 
of header sequence, message DNA and check sum. To retrieve the data, a novel biolog¬ 
ical information processing system was develop. Encryption is achieved through DNA 
sequence shuffling Rci recombination system by using site specific recombination by 
Recombinase (RCi) gene. They mapped the DNA using restriction enzyme so that data 
can be addressed just like filing system in magnetic storage. Live bacterial cells are 
used for data storage and they works like a transistor in the electronic devices which 
has on and off state. Memory device was designed that instruct the cells when to start 
the division and stop the division. This kind of devices will be useful in treatment of 
cancer and other diseases. 

Storing the data in bacteria was a successful attempt but creating a rewritable stor¬ 
age was still a challenge that was solved by researchers at Stanford University by 
development of rewritable Recombinase addressable data (RAD) to store and rewrite 
digital information [Bonnet et al. 20121. With the help of enzymes one can modify DNA 
at specific site and can exchange DNA sequences at specific location. This can be done 
by enzyme recombinase which allows the strand exchange between site specific DNA 
sequences [Grind ley et al. 2006 1 which mimics the f lipping behaviour of a bit by using 
recombinase-mediated DNA inversion [Ha m et al. 2 0081. RAD module includes inver¬ 
sion of DNA by integrase and excisionase which depicts the bidirectional behaviour of 
the systems. It has two transcription input signals named set and reset. Set controls 
the expression of integrase that flips the DNA serving as data register. Other input re¬ 
set drives the expression of integrase along with excisionase as co-factor that restores 
the direction of DNA element. This resulted into DNA registers which stores two states 
like finite automaton which can be flipped on basis of successive input signals. Here 
the states 0 and 1 are depicted by the green and red fluorescent protein, respectively. 
Depending upon the orientation of the specific DNA sequences the state of DNA is ob¬ 
served. Tuning and integrating the expression of recombinase, they developed a first 
reliable and rewritable DNA inversion-based data storage system. 

Other perspective for bacterial data storage was perceived by researchers at Britains 
University of Leeds and Japans Tokyo University of Agriculture and Technology. They 
used bacteria Magnetospirilllum magneticum to organically grow tiny magnets which 
can store bits of data [ [Tanaka et al. 2012) . Magnetic storage devices are built by cut¬ 
ting down the large magnets into tiny magnets and embedding it on to the storage 
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tapes like hard disk and DVD. Instead of cutting magnets, scientist thought of creat¬ 
ing tiny magnets out of some natural source that can be used to store the data. This 
bacteria has capacity to ingest the iron and produce the crystals of mineral magnetite. 
Researchers studies this mechanism of the bacteria and tried to mimic the same out- 
side the body. They a rranged magnetic particle specific iron-binding protei n Mms6 
[Gallowa y et al. 2012b| in chess board pattern and dipped in the iron solution [T anaka 


et al. 2011|. This experiment resulted in growth of tiny magnets that m ay be used 


as potential material to built the storage devices [ |Galloway et al. 2012a) . Each nano 
cube can store a bit of information with size 20000 nano meters wide much larger than 
magnetic storage device with 10 nm magnetic pits. The researchers are now working 
on miniaturizing the size of nano magnets and using alternative magnetic material 
to develop single array nano magnet that can store one bit of information. In-spite of 
these success stories for bacterial data storage, there are many problems to be focused. 
Designing of error models for bacterial data storage, development of error correction 
codes, modeling of Shannon capacity for bacterial storage are main difficulties. 

5. PROTEIN HARD DRIVE 

Protein plays central role in the functioning of the body and stores the behaviour of 
human in form of folded chain of amino acids. The communication takes place between 
various organs at particular instance of the time in response to specific protein. In 
order to use protein as the storage medium, identification of proteins which can repli¬ 
cate the binary storage technique was inevitable. The study of photo-switching protein 
has revealed many applications in nanoscience technology one of which is data stor¬ 
age [Sauer 20051. The foundations for the same was laid by Hirschberg and colleague 
by proposing first photochemical memory model based on the color transformation of 
molecule called spiropyrans which can flip to other form on absorption of single pho¬ 
ton [ [Hirshberg 1956) . Furthermore, using photo-induced fluorescence proteins that 
can switch between two states for data storage is described in [Tsien 19981. Switching 
mechanism can be between the state of two different colors like red and green, other 
can be dark or light state. Researchers in the area found solution to it by studying 
family of proteins called photochromic proteins like Photo convertible Florescence pro¬ 
teins (PCFPs) and Reversible switching florescence proteins (RSFPs) which are light 
driven switchable florescent proteins [Adam et al. 20101. Not only reading and writing 
the data but they could erase the data and rewrite it again on the proteins. So this 
was first remarkable attempt of creating rewritable natural data storage. PCFPs pro¬ 
tein called Kaede [Ando et al. 20021 and RSFPs protein known as Dronpa [Ando et al. 


2004J was used to writ e and rewrite the data. Cis/trans isomeriz ation of chro mophore 


I Luk yanov et al. 2 0001 along with photon induces protonation [Ad am et al. 2008 1 of 
chromophore is responsible for photo-switching in two different states. The informa¬ 
tion can be stored in the area designated green and red colors which are similar to 0 
and 1. The state was determined by using EosFP, a fluorescent marker protein which 


is UV-inducible green-to-red fluorescence. Different material as described in [Adam 
et al. 20101 were used for surface coating of all proteins. To write data on the protein 


surface, an inverted laser-scanning microscope with particular specification was used. 
The reading, writing and erasing of data was done by using the laser beam at different 
intensity levels. 

The idea of using single crystals of PCFPs/RSFPs protein as 3D storage medium 
had been implemented where protein molecule in the crystal would represent a data 
bit [Hell et al. 20071. In this instead of binary encoding, four color florescence switching 
proteins was used with mutant of RSFPS which can bu ilt the 4-base data sto rage sys¬ 
tem by using two photon excitation (TPE) techniqu e [Mandzhi kov et al. 1973| . Using 
the mutant of PCFP EosFP, IrisFP [Adam et al. 20081 the 4 based data storage was 
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implemented as it had combine property of both the proteins of irreversible switching 
from green to red and other reversible from dark to bright state. This technique is 
more efficient as it focus on the precise location for the storage. 

One other idea was implemented by group of Venkatesan Renugopalakrishnan at 
Harvard University using the Bacteriorhodopsin (bR) protein for data storage. The 
concept of using bR was spearheaded by Jack tallent [Tallent et al. 19961. The re¬ 
searchers used Bacteriorhodopsin (bR) light-activated protein found in the membrane 
of a salt marsh microbe Halobacterium salinarum (which use it for their photo syn¬ 
thesis) to coat storage device DVD which may increase the data storage capacity up 
to 50 TB [Renugopalakrishnan et al. 2006) lOesterhelt et al. 19911 [?]. It convert light 
energy into chemical energy. When lights comes in contact it prod uces some m olecules 
of different state and remain in the state for few minutes or days [Birge 19951. Unlike 
today’s storage device the bR molecule of 2 nm in diameter has demonstrated a long¬ 
term stability with a shelf life of at least 10 years at room temperature and is believed 
to be stable at temperatures up to 140 degree Celsius. The team modified bR protein 
to enhance its thermal and photochemical properties in a way that it can remain in 


this intermediate state for few years at high temperature [Renugopalakrishnan et al. 
20031. In this the binary encoding is done by the concept that protein in brighter state 
is considered as 0 and in dark state can be considered as 1. They worked on how to use 
charge transporting proteins such as Bacteriorhodopsin in the building of data stor¬ 
age and transmission devices for applications in computer technology. When laser of 
one color incidents on the protein, it get arrange into one shape designated as zero in 
binary system and when laser of another wave length stimulates the protein to take 
another shape represented as one. Once the laser system is switched off, data can be 
stored for several year. To read the data from stored protein, a low power laser beam 
is delivered on this protein slowly so that the protein confirmation is not disturbed but 
only the light is absorbed by the pattern in the protein which can be detected by the 
machine and can generate a string of 0s and Is. The property of bR to shifts between 
intermediated states made it potential for rewritable data storage. 

Using peptide as storage device was implemented by amalgamation of the nm scaled 
bio-organic nano dots into bio-electronic devices [Amdursky et al. 2013]. This bio- 
organic nano-dots called peptide nano dots (P NDs) of 2 nm size, were self-assembled 
from the Diphenylalanine (FF) peptides [Jeon et al. 2013] and can be embedded into 
metal-oxide-semiconductor devices as charge storage nano-units in non-volatile mem¬ 
ory. FF gets self assemble into nano tubes and the structure is stabilized by the non- 
covalent interactions [Santhanamoorthi et al. 2011| . Size limitation (micrometer) of 
the FF tubes jeopardized it’s use for the bio-electronic devices. But researcher ob¬ 
served that in anhydrous condition, FF tubes get dissemble into stable building blocks 
of PNDs, paving the way for using it in bio-electronic devices. PNDs were success¬ 
fully used as charged storage elements for the non volatile memory devices (NVM) by 
replacing the ONO(oxide-nitride-oxide) dielectric in the NVM. Many other nano-dots 
like Au and Pt and organic dyes were earlier employed for the same, but the beauty 
of PNDS to be nano-crystalline, uniform nm size, low temperature deposition makes 
them superior. Two crucial steps were followed to use FF PNDs in NVMS. First the 
property of each nano-dots was deciphered using electron microscopy. Using their elec¬ 
tron diffraction pattern, nano crystalline structure was confirmed. Following it mono 
layer of PNDs which serves as memory stack was formed which retains the charge. For 
the detail procedure, reader is advised to refer [Amdursky et al. 2013]. Protein bases 
information storage systems are at infancy stage and opens many challenges like read¬ 
ing and writing the data, data rate, speed of data access and deciphering the mutants 
of the photo-chromic proteins for intense research. 


ACM Journal on Emerging Technologies in Computing Systems, Vol. V, No. N, Article A, Pub. date: January YYYY. 

























Natural Data Storage: A Review on Sending Information from Now to Then via Nature 


A:13 


6. EXPERIMENTAL EVIDENCES AND CHALLENGES 

Research done so far,undoubtedly,acquaint about the potential of natural data storage 
devices but still to bring it into commercial applications, many issues like cost for the 
storage and experimental challenges and human expertise are to be perceived. Today 
storage device can read data at 100 MB per sec, which is a much higher than the data 
access rate of nature hard drive. Despite of the fact that DNA is scalable, stable and 
robust storage device, synthesis and sequencing process involved are time consum¬ 
ing and require the expertise which make DNA storage an unreachable to commoner. 
As development in the field of next generation sequencing techniques is accelerating at 
higher rate compare to the digital storage medium. It can be estimated that in near fu¬ 
ture this technology will become cheaper [ Schatz et al. 2010]. Decreasing the cost of the 
synthesis and adapting parallel automation 1 |Cleary et al. 2004) and simplified purifi¬ 
cation techniques are the aspects one has to focus. Improved error correction schemes 
with high storage capacity are the theoretical challenges in the field. Other limitation 
is size of the DNA fragments that is used to store this data. Current synthesis and 
sequencing techniques are limited to process certain small size of DNA sequences. So 
the advancement in the sequencing and synthesis techniques can aid to m ake the DNA 
storage more feasible in coming era [Fuller et al. 20091 [Hogrefe et al. 2013) . Taking the 
first step towards the construction of molecular data storage system, researchers have 
made a paradigm shift in DNA reading and writing techniques by proposing the tech¬ 
nique to built a DNA storage device that has reading and writing chambers [Khulbe 


et al. 20051. DNA readout experiments conducted in miniaturized chambers which can 
be alternative to existing technology for DNA synthesis and sequencing have been ex¬ 
plored. They adapted the methods of DNA processing used by molecular biologist to 
built the data storage chip. String of macromolecule (here DNA) containing the bytes 
of information is created and to secure the information they are translocated to safety 
zones called parking spots. To read the data, they may be transferred physically to de¬ 
coding stations and data can be by controlled electric-field gradients, electronic micro 
motor etc. [Mansuripur 20051. Obstacles in this methods are data access and data rate 
that is very low (12 kbit/s) that is t o be improved. For more details and system archi¬ 
tecture reader is suggested to refer | Khulbe et al. 2005) . The other important challenge 
is the ease retrieval and random access. It need efficient random access and improved 
rewritable methods. Scaling the natural storage capacity is one of the important area 
of research to make molecular storage as commercial application. For the data stored 
in bacteria, the bacteria cultures and incubation required a lot of human expertise to 
avoid a chance of contamination. This rise an issue of data security. So this method¬ 
ology has to deal with data security and knowledge base to handle the population of 
bacteria used as the medium to encode data. Moreover there are many low hanging 
fruits in the area of encoding and decoding algorithms to store data in bacteria. With 
the bliss of synthetic biology, scientist at Craig Venter Institute have synthesised arti¬ 
ficial bacterial cell with synthetic chromosome and watermarked data into living cell of 
bacterium Mycoplasma mycoides which has capacity of self replication [ Gibson et al. 


20101. Novel innovation like this motivates the dream of bacterial data storage. As 


far as protein is concern, there is lot be explored. There is very few evidences which 
depicts the potential of protein as storage medium but the above mentioned work is 
headway for new milestones. Breakthroughs in programmable protein synthesis which 
may replicates the nature of computer hard drive is at far vision in the domain. Effi¬ 
cient algorithm which can map data to amino acids sequence is one most challenging 
part to make this possible. Below is the executive summary of the natural data storage 
and their challenges. 
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Table II. Executive Summary of natural data storage devices and their evidences 


Source 

Encoding and decoding algorithm 

Storage capacity achieved 

Properties 

Experimental validation 

DNA 

Yes, effective Huffman base 

3 encoding scheme 

490 EB (exabytes) per gram 

Rewritable, scalable, stable 
under extreme conditions, dense 

Yes, data was stored 

and retrieved with 100 percent accuracy 

Bacteria 

Yes, but preliminary encoding 

data to codes and mapped to bacterial genome 

4502, 000-gigabyte hard 
disks per gram of bacteria 

Rewritable, secured, dense, 
high duplication rate 

Yes, (as mentioned in 
section Bacterial hard drive) 

Protein 

Yes effective encoding data to colors, proteins 

with two state systems (dark-bright; 

green-red; or both for 4-base data encoding) 

are used to mimic the binary storage system in computer 

Yet to be uncover 

Long term storage, secured, stable 

Yes, as described above 
in section Protein hard drive 


7. CONCLUSION 

With this explosion in the amount of data, natural storage seems to be the solution to 
preserve the data as archival for longer period. Considering the challenges for the nat¬ 
ural data storage, it will not immediately replace the computer storage drives. Never¬ 
theless, with the advancement in synthetic biology technologies, the day is near where 
this dream will come true. The main focus of this area can be on data security, im¬ 
proved encoding and decoding approaches, making technology cost effective and far¬ 
sighted developing the protein and molecules of bacteria to generate the components 
of the computer by using bottom up approach. The quest for improving the existing 
storage devices craving for the energy has forced the researchers to turn their atten¬ 
tion to replace it with eco-friendly storage devices in coming decade. Though there are 
many rooms for the development of robust natural storage device, one can imagine a 
near future where the technology will allow the computers around the internet to ex¬ 
change the information on its own, self replicate the information and even mutate or 
improve the content and correct error on its own. 
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